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Abstract 

^ Hidden Markov models and their variants are the predominant sequential classification method in 

such domains as speech recognition, bioinformatics and natural language processing. Being generative 
rather than discriminative models, however, their classification performance is a drawback. In this paper 
we apply ideas from the field of density ratio estimation to bypass the difficult step of learning likelihood 
functions in HMMs. By reformulating inference and model fitting in terms of density ratios and apply- 
I— I ing a fast kernel-based estimation method, we show that it is possible to obtain a striking increase in 

l_J discriminative performance while retaining the probabilistic qualities of the HMM. We demonstrate ex- 

perimentally that this formulation makes more efficient use of training data than alternative approaches. 

^ 1 Introduction 

Inference of a sequence of estimated classes from a sequence of noisy observations is fundamental in many 
r-H applications. The hidden Markov model (HMM) and its variants are the usual methods employed to do 

^ this, and have been used with conspicuous success in such domains as speech recognition, bioinformatics and 

natural language processing. As well as being computationally efficient, they are a popular choice due to 
their intuitive probabilistic interpretation. However, they have drawbacks in terms of classification accuracy, 
being primarily generative rather than discriminative models. 
• One established approach to improve classification performance in HMMs models has been to adapt the 

model to discriminative forms which apply information theoretic principles in training [SJ [55] . This improves 
classification performance, though the requirement in general to limit such models to parametric forms 
T-H means they still do not have the discriminative power of kernel-based and max-margin methods. Another 

idea therefore is to create models combining the structure of the HMM and the classification approach of 
Support Vector Machines [TJ[Tni[lD]. This considerably improves classification performance, though at the 
expense of losing the intuitive probabilistic interpretation of the HMM. 

In this paper, we propose a different idea which combines the advantages of both the above approaches, 
using concepts from the field of direct density ratio estimation 171 15]. Our key observation is that rather 
than trying to quantify a set of likelihoods (i.e. how likely any possible observation is given some class, 
relative to the likelihood of making other observations given that same class), it is in fact only necessary 
to know about likelihood ratios (i.e. how likely any possible observation is given some class, relative to the 
likelihood of making the same observation given a different class). Because the forward-backward inference 
algorithm computes such ratios anyway, we therefore follow the principle that we should avoid solving a 
more general learning problem than is strictly necessary |21| . 

We therefore reformulate the forward-backward algorithm for HMM inference in terms of density ratios 
so that the intermediate step of likelihood function estimation can be dispensed with. We demonstrate how 
efficient and highly discriminative nonparametric inference can be carried out in this framework using a 
kernel-based density ratio estimation procedure |1] . Because density ratios are a natural parameterization of 
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the forward-backward algorithm, the resulting inference procedure is also more numerically stable than the 
conventional version. We demonstrate these ideas using synthetic data and on a physiological monitoring 
problem. In the case of physiological monitoring, we also demonstrate an application to sequential anomaly 
detection. 

The structure of the rest of the paper is as follows. Section [2] reviews the conventional forward-backward 
algorithm for hidden Markov model inference. Section[3]recasts this procedure in terms of ratios of probability 
densities, and Section |4] describes how nonparametric estimates of those density ratios can be calculated 
directly from training data in both supervised and unsupervised settings, and how estimates of several 
pairwise likelihood ratio functions can be obtained from a concise set of parameters. Experimental results 
in Section |5] show the performance of the density ratio HMM methodology on synthetic and real-world 
physiological monitoring data, showing striking improvements compared to conventional parametric and 
nonparametric sequential inference approaches. 

A Matlab implementation and demo is available at |http : //cit .mak . ac . ug/ staf f / jquinn/ sof tware/| 
[densit yratioHMM . htmlj 



2 Forward-backward inference 



Consider the estimation of a latent sequence xi-t = {xt G S\t = 1, . . . ,T} from an observation sequence 
yi:T = {y* S M.'^^\t = 1, . . . ,T}. The variable xt is assumed to have first order Markovian dynamics, such 
that p{xt\xi:t-i) — p{xt\xt-i), the values it can take on are a discrete set of classes or 'states' iS = {1, . . . , S}, 
and each observation is independently drawn from a fixed emission distribution piytlxt)- Given a sequence 
yi:T and knowledge of both p{xt\xt~i) and p{yt\xt) and the initial state distribution p{xi), the forward- 
backward algorithm can be used to estimate the probability of each state at each time frame, p{xt=i\'yi:T)- 
We use the following shorthand in describing the algorithm: A denotes a matrix of state transition probabil- 
ities, such that Aij = p(xt=j\xt^i=i) and tt is a vector of initial state probabilities such that tt^ = p{si=i). 

The first stage of the algorithm involves recursively calculating forward messages at{i) for each of the 
states i Cz S: 



5.(1) 



TT^p{yt\xi=i) 
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piyt\xt^i) 
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(2) 



After each time step, it is conventional (but not necessary) to normalize the forward messages, 

ai{t) ^ ctai{t) s.t. ^ai(t) = l. (3) 

ies 

Without this normalization the algorithm is numerically unstable, as the values of on otherwise become very 
small over repeated iterations. The step is optional however because normalization occurs anyway later in 
the procedure (in Eq. The forward messages can be interpreted as the posterior probability of each 

state given observations up to that time frame, the process of calculating this being known as filtering. Note 
that the normalization means that the absolute values of p{yt\xt=i) are not directly significant for inference; 
we are ultimately interested only in the relative magnitudes. 

To carry out smoothing (calculation of p{xt=i\yi;T)) the backwards messages /?((«) must first be similarly 
calculated: 



MT) = 1, 

A(<) = EA,jp(yt+i|xt+i=j)/3,(i + l), 



(4) 
(5) 



t = T-l, 
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also with an optional normalization step carried out after each iteration, 



(3,it) = c'Mt) s.t. = (6) 

The two types of messages are then combined to give the final result 

7. W = Pix,=^\y,..T) = ^ . (7) 

The forward-backward algorithm therefore requires an explicit likelihood model for every dynamical 
regime, is numerically unstable (requiring message scaling or transforms in and out of log space to prevent 
underflow), and loses information during normalization steps. We next describe how a density ratio formu- 
lation overcomes these problems, by expressing inference in terms of parameters which are more natural to 
the problem. 



3 Inference with density ratios 

In this section we rearrange the above inference equations in terms of pairwise probability density ratios. 
Ratios of probabilities make it particularly convenient to express Bayesian updates, as the ratio of two 
posteriors is equal to simply the ratio of the priors multiplied by the ratio of the likelihoods. In order to 
express the forward-backward equations in this way, corresponding to the three types of values , /3i , 7^ we 
deflne three types of ratios , j , ''i j" as 



We also treat likelihoods in terms of density ratios, and deflne 
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Using Eqs. we can derive the following expressions for density ratio versions of the forward equations: 



— w,j(yi), 

[Ekes ^k{t - l)A,fc] p{yt\xt^i) 
[Lk'es^k'{t-l)Ajk-]p{yt\xt^j) 

fcG5 U-'e5 
< = 2,...,T 



(8) 



E 



^^A'.fe(i-i) 



w. 



1,3 



(y*) 



(9) 



Although these expressions are written in terms of all pairwise ratios, implying that S'^ terms have to be 
calculated at each time frame, in fact these ratios can be estimated with a concise set of parameters as we 
demonstrate in Section |4l 
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For backwards messages we use Eqs. (^|5l in a similar way to obtain 



tr 



(10) 
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The messages are finally combined simply with 



it) 
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The filtering and smoothing probabilities, if needed, can be simply calculated from the ratios with ai{t) = 
'^z,s(t) g^j^^ "/At) = '•'■s(*) ^ This completes the density ratio forward-backward algorithm. 

To provide some intuition about the ratio-based procedure we look in some more detail at a two-state 
HMM, S — {1, 2}. The forward equation in this simplified case reduces to: 



and the backward equation to 



^l,2(t) 



Aii^i,2(t-1)+Ai 
A2i^i,2(i- 1) + A22 



-wi,2(yt) , 



probabilities in A 
1 



All r i^2{t + l)wi,2(yt+i) + A21 

A2lVi^2(i + l)wi,2(yt+l) + A22 

p(xt_i = l|yi:f_i) p(a:t = l|yi:t-l) 



(12) 



(13) 



in terms of the transition 



The first term of (12 1 specifies the evolution from r —( ti — 

Hence this is simply the ratio form of marginalizing out the transition probabilities. Figure 

a: 1 — a 
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shows this evolution of the state probability ratio given transition matrices of the form A = 
for different values of a. If a = 0.5, then any past information is lost; ^|^*~^!^'"-*~^? is always equal to 1, 



hence both states are equally likely a priori at time t. If a = 1 

is equivalent to accumulating evidence in favor of either state 1 or 2 globally, and Eq 



p(^t=2|yi;t-i) 
there can be no state transitions. Inference 

reduces to 



(12) 



nt=i ^i,2(yt)- The ratio form of the Bayesian update given the observation is simply 



multiplication by wi,2(yt)- 

The ratio formulation of the forward-backward algorithm is numerically stable, and Figure [T] provides an 
intuition of this; extreme ratio values tend to be mapped back to within a few orders of magnitude of unity 
after considering transition probabilities. Scaling of values to prevent underflow is therefore not necessary 
with this method as it is in the conventional forward-backward algorithm. 

Hence, given a way to estimate Wi,j(-) from training data, it is possible to carry out inference in the HMM 
without ever needing to calculate the individual likelihoods. These steps are mathematically equivalent to 
the standard forward-backward algorithms, but do not require any normalization step. If the observation 
distribution is known exactly then the ratio formulation is equivalent and has no advantage. However, when 
the observation distribution needs to be approximated in some way - which is almost always the case with 
complex real-world data - the parameterization in terms of Wij{-) is more natural to the sequential inference 
problem than attempting to approximate each of the piytlxt—i) distributions directly. We discuss ways in 
which the estimation of these ratios and other parameters can be carried out in the next section. 



4 Parameter estimation 



The parameters to be learned in the density ratio HMM model are the likelihood ratio functions Wij{y), 
the transition matrix A and initial class probabilities tt. In this section we first suggest effective methods 
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Figure 1: Forward probability ratio evolution function given different transition matrices, for a 2-state HMM. 
See text for details. 



for estimating (y) which bypass the difficult step of estimating the densities p{y\x = i) and p{y\x — j) 
individually. In Section |4.1| we introduce an efficient method for carrying this out when there are two classes 
to be modeled, an give a related method for several classes in Section [4!2] which does not need every pairwise 



ratio to be estimated individually. The estimation of A and tt is then discussed in Section 4.3 



4.1 Two-class density ratio estimation 

We begin by discussing how to estimate a likelihood ratio from data in the case that there are only two 
states i and j in the model, for example as would be needed to calculate (12 1 and (13 1. There is just one 
ratio to learn, since Wij = wj^. Several techniques have been developed for direct density ratio estimation 
[lOl [21 |3j III [IHl [17] . We use a least squares approach here |4] which yields a consistent estimator with very 
good computational efficiency. The estimator is of the following form: 

wrjiy) = djj4>iy), 

where 

for some number of parameters B, and 

0(y) = {K{y, yi), . . . , K{y, y G (14) 

is a vector of kernel basis functions. We can set B — N to have a kernel basis function at every training point, 
or for B < N use some random subset of the training points. In this work we use the squared exponential 

kernel K{y, y') = exp (^^^) . 

This model can be fitted using a squared loss objective function, 

Expanding the squared term we obtain 

JijiOij) =\ i Wij{yfp[y\x=i)dy 



I 



w^Jiy)piy\x^i)dy + C , 



where C is a constant term that does not depend on any of the Oij values. Empirically, we can approximate 
the expectations by sample averages. Ignoring the constant C, factor 1/iV and including an ^2-regularizer, 
we have the following training criterion: 

where $ = (</)(yi), . . . , </>(yAr))^, vcii is a column vector indicating membership of class i such that the jth 
element is one if Xj = i and zero otherwise, and Mj is a square matrix with nij along the diagonal and other 
entries set to zero. Ji{6i) is minimized by 

% = (*^M,* + pIbY^ *^m, . (15) 

We select p and a with cross validation. Because of the nature of the estimator, it is sometimes possible 
to obtain negative values for 'Wij{y). We simply round up negative estimates to zero in such cases, which 
does not affect the consistency of the estimator |5J. This least-squares approach is very fast to compute in 
practice, finding a global optimum in a single step with no iterative parameter search required. 



4.2 Extension to multiple classes 

In problems where S" > 2, it would be possible to simply learn several pairwise likelihood ratio functions. 
Because Wij = wj^ and wu — 1, this would entail learning ^ ratio functions. Fewer still need to be 
estimated if relationships of the form Wij = Wik/wjk are used, though this 'ratio of ratios' may become 
unstable when the denominator Wjk is close to zero. 

For multiple-class ratio estimation another approach is to directly estimate conditional probabilities of 
class xt given i.i.d. samples {y(^\ . . . ,y(^)}. Using the fact that p(a;=i|y) = ^ ■l'flly\x=])pil=j) ' likelihood 
ratios in our problem can be estimated as 



P{y\x=i) ^ p{x=j) p{x=i\y) 



(y) = - i;^;,:,; • (i6) 



This is similar in principle to the scheme of using classifiers to calculate likelihoods in standard HMMs 
discussed in (author?) [9J. 

A standard approach to calculating the p{xt—i\y) terms is kernel logistic regression. This can be com- 
putationally demanding for large datasets though, so we use a least squares procedure [15 similar in form 
to the two-class estimator we introduced in the previous section, and which is known to give comparable 
accuracy to kernel logistic regression but requiring far less training time. 

We construct functions q{x=i\y,9i) to estimate p(xt=i|y), defined as 

qix^t\y,e,)^ejct,iy) 

where 



and 0(y) is the same as in Eq. (14 1. The squared loss term in this case is 

H0:) = \ I {q{x^i\y,0i)-pix=i\y)fpiy)dy. 
Expanding and using p{x\y) — p{y\x)p{x) /p{y) we obtain 

Me,) =i J q{x=i\y,e,fp{y)dy 



- j q{x^i\y, 0i)p{y\x^i)p{x=i)dy + C , 
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which can be empirically approximated to give the following training criterion: 



Ji{Oi) is minimized by 



e, = + '*^m,, (17) 



which is essentially kernel ridge regression. As in the two-class case, we select p and a with cross validation. 
Also as before we round up negative estimates to zero, q{x=i\y,6i) = max ^0, 0j </)(y)^, though we note 
that that negative estimates are unlikely when the number of training samples is large enough |15j . 



Now using Eq. ( 16 ) we are able to calculate likelihood ratios for any pair of states using 

- , , Ui q{xt=i\yt,Oi) 

Wi j (t) — ^ — , 

"■i qixt=j\yt,dj) 



where rii is the number of training samples labeled Xt — i. Just as in the two-class case, Eq. (17 1 can be 
computed very quickly. 

4.3 Setting other parameters 

We now discuss setting parameters A,7r in the density ratio HMM given training data, first when states 
Xi:T are available and second as an unsupervised learning problem when we only have observation sequences 
yi:T- 

If xi;T is observed in training data, p{xt=i\xt-i—j) and p{xi—i) can be estimated directly by frequency. 
The process of setting the remaining parameters A,7r in the case that the labels xi;t are present in the 
training data is therefore identical to the standard HMM; we refer the reader to the details in (author?) 

m- 

For the unsupervised case in which only the observation sequence yi:T is available for training, learning 
can be carried out by iterating between (1) a likelihood-maximization step to update A, tt, {9i} given esti- 
mates of p(a;i:T|yi:T) and (2) an expectation step, the inference procedure given in ^ To do this some initial 
estimate xi;t is required as a starting point, which could be obtained by running a standard non-dynamic 
clustering procedure such as A:-means on the observations. The iterations are continued until convergence or 
some limit on the number of cycles is reached. For unsupervised learning we require one alteration to the 
parameter estimation procedure for Wij{-) in order to accommodate soft estimates of state probabilities ji 



rather than hard labels. The weighted version of Eq. (17 1 is 



= (*^r«* + piB) '*^[7,(i),...,7,(B)]^ 

where r*-*-* is a, B x B diagonal matrix such that r'-*| — The maximization step updates for A, tt given 

7,; are again identical to the standard HMM case 



5 Experiments 

We now give experimental results using the above methods when applied to both synthetic and real-world 
datasets. 

5.1 Sequential classification on toy data 

To give an illustration of HMM inference using density ratio estimation, we generated data from a sim- 
ple switching model with three dynamical regimes. The transition probabilities in the model were set to 
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Filtering, Accuracy=0.899 




Figure 2: Inference example on noisy sine wave sequence. The top left panel shows a sampled sequence, which 
switches randomly between three dynamic regimes (sine waves of three different frequencies with Gaussian 
noise added). In the bottom left panel, the blue line shows the true value of xt, and the red crosses show the 
estimated most likely value given density ratio estimates independently at each time frame. The top right 
panel shows the inference improving when forward dynamics are incorporated; the bottom right panel shows 
further improvement when both forward and backward dynamics are incorporated. 



.98 .01 .01 

A — .01 .98 .01 ..^v.. ^^.^^^^^ ^^^^^ i^^^.^^.^.^.^.^^ ,. — I o, oi 
_ .01 .01 .98 '-'^ 

generate sequences xi;t- Scalar sequences yi-x were then generated conditioned on xt as follows: 



with initial state probabilities tt — [|j |j |], and these parameters were then used to 



{sin(0.2i) + r/t Xt = I 
sin(0.4i) + 77* Xt = 2 
sin(0.6i) + rjt Xt ^ 3 

with rjt ^ A/'(0;0.25). A sample sequence is shown in Figure [2] (top left). To construct the vector sequence 
yi-T used for testing inference, we took subsequences of i/i-t using a sliding window of length dy (using 
dy = A for this example) such that = yt-dy+i-.t, for t > dy. 

Figure [2] (bottom left) shows the results of using density ratios at each time step to find the most likely 

class independently at each time step, by finding argmaxj (^J2jes ^^divt)^ ■ This is equivalent to filtering 

inference using a uniform transition matrix A^j = ^ , and gives an idea of how informative the subsequences 
yt are about xt at individual time frames when no dynamical information is incorporated. In this plot the 
blue line shows the true values of xi-^t in the sample sequence, and the red crosses show the MAP estimates. 
The top right panel shows MAP estimates with filtering inference using the forward equations only. The 
bottom right panel shows MAP estimates with smoothing inference using both forward and backward steps. 

Inference results were compared to those from two alternative models. The first was an alternative 
nonparametric hidden Markov model, using kernel density estimation (using a squared exponential kernel, 
with parameters chosen by cross-validation) to model p{yt\xt—i) for each i e 5, as proposed in (author?) 
[5]. The second was the conventional Gaussian mixture model approach to modeling the observation density 
of each regime |12| , with the number of components of the mixture selected in each case being that which 
minimized the Bayesian Information Criterion (BIG) on training data. Training sequences were randomly 
generated from the above switching noisy sine wave example, of lengths between 50 and 500 subsequences for 
each of the three dynamic regimes. Test sequences of length 1000 were also generated as above. The density 
ratio HMM, KDE-HMM and GMM-HMM were trained and apphed to the test sequence. The accuracy. 
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Figure 3: Accuracy of state estimates in noisy switching sine wave problem, mean and standard deviation 
over 500 runs per training sample size. 



calculated as the proportion of time frames for which the MAP estimate maxi p{xt=i\yi:T) was equal to the 
true simulated state Xt, was calculated for both methods. This was repeated 500 times for each training 
sample size. Figure [3] shows the mean and standard deviation of accuracy for filtering and smoothing with 
all three methods. The density ratio HMM has consistently higher average accuracy than the other methods, 
particularly for small training set sizes. 

5.2 Physiological monitoring 

We next evaluated the density ratio HMM using time series of physiological measurements from premature 
infants receiving intensive care. The dataset we used consists of 24 hour sequences of monitoring data from 
15 babies, a total of 360 hours of data with measurements taken at one second intervals. The measurements 
are of vital signs (heart rate, blood pressures, temperatures, blood gas concentrations) and environmental 
measurements (incubator temperature and humidity). The data is annotated with the occurrences of four 
common phenomena: bradycardia (temporary slowing or stopping of the heart), opening of the incubator, 
the taking of a blood sample, and disconnection of the core temperature probe. Any period in the data 
during which there was some clinically significant change not covered by one of the four conditions above 
was annotated as a fifth class. Finally, for each baby a period of around 10 minutes was annotated as 
'normal', i.e. representative of that baby's baseline physiology. The data is publicly available and described 
in (author?) [TT]. 

We first trained a set of two-class density ratio HMMs with this data, treating each state in the annotation 
as a separate inference problem. To train the bradycardia model for a particular baby, for example, we would 
take the period for that baby annotated as normal as training data for the first state, and periods annotated 
as bradycardia from other babies as the training data for the second state. 

We compared the output of the density ratio HMM to that of a factorial hidden Markov model (FHMM) 
and factorial switching linear dynamical system (SLDS), recreating the evaluation of (author?) [TT. As the 
problem associated with this dataset is real-time patient monitoring, we applied filtering inference only. The 
latter methods were developed using extensive domain knowledge as to the physical processes underlying the 
observations. The evaluation was done using 3-fold cross validation on the set of 15 sequences. Area under 
ROC curve (AUG) and equal error rate (EER) for each of the classes in the annotations were calculated 
for the three methods, shown in Table |5.2[ The density ratio HMM gives either equivalent or superior 
results in all cases, for example achieving a 17% increase in AUG and 13% decrease in EER for detection of 
temperature probe disconnection compared to the next best method. 



9 



FHMM SLDS DR-HMM 



Bradycardia 


AUG 
EER 


.66 
.37 


.88 
.25 


.92 
.13 


Incu. Open 


AUG 
EER 


.78 
.25 


.87 
.17 


.88 

.17 


Blood Sample 


AUG 
EER 


.82 
.20 


.96 
.14 


.96 
.05 


Temp. Probe 


AUG 
EER 


.74 
.32 


.77 
.23 


.94 
.10 


Abnormal 


AUG 
EER 




.69 
.36 


.75 
.31 



Table 1: Classification accuracy on neonatal intensive care unit time series data, for occurrences of brady- 
cardia, opening of the incubator, blood sampling, temperature probe disconnection and "other significant 
deviation from normal dynamics". FHMM denotes the factorial hidden Markov model, SLDS denotes a 
switching linear dynamical system, and DR-HMM denotes the density ratio hidden Markov model. The 
highest AUG and lowest EER are highlighted in bold. 



Anomaly detection 

We also constructed an anomaly detection model for each test sequence, to assess the ability of this framework 
to identify any clinically significant deviation from known types of physiological variation. Using the multiple- 
class density ratio estimation procedure in S4^ this can be modeled quite easiljj^ 



Assuming that observations from outlier classes might be present in test data, we use xt—-^, * ^ 5 to 
denote any such class at time t. The method we propose for discriminating outliers from inliers is similar in 
essence to the one-class support vector machine |14| and the kernel Fisher discriminant method for outlier 
detection [13_. These methods are based on the assumption that outliers occupy low-density regions of the 
data space and that a kernel model can be used to characterize the high-density regions given training data. 
Any given significance threshold can then be used to separate the inlier and outlier level sets. 
We estimate the conditional probability of an outlier p{x= * |y, 0i) with 

q{x^*\y,e,) = l-ej(i>{y). (18) 



The problem of identifying outliers can then be equated with learning 6^, such that Eq. ( 18 1 is close to zero 
when y is within a region in which training data has high density, and is close to one anywhere else. To 
achieve this we minimize the following loss function: 

J2i02) = ll (l-0j0(y))%(y)dy + ^||02ir (19) 
The solution to this is given by 



^An outlier detection method which could be used with the two-class ratio estimation method in ^4.l| is described in 
(author?) 0]. 
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Therefore 

q{x= *\y,d,,...,ds) = i-Y, ^ '^(y) • 

ies 

The parameters learned to model the inlier classes can therefore also be used for outlier detection. 

To apply this to the problem of anomaly detection in the physiological monitoring dataset, we trained 
inlier class parameters using the reference period of the test sequence annotated as normal as well as any 
subsequence of training data annotated as bradycardia, incubator opening, blood sampling or temperature 
probe disconnection. Inference of pi^xi-x = *|yi:T) in test data therefore gave an estimate of whether any 
significant physiological changes not consistent with any of the known dynamical regimes were occurring. 

We compared this to the anomaly detection method based on the SLDS in (author?) [11 . Again using 
3-fold cross-validation on the 360 hours of annotated data, the performance of our method was found to be 
superior to that of the SLDS, with 6% increase in AUG and 5% decrease in EER. 

6 Conclusions 

In this paper we have demonstrated that direct density ratio estimation methods can be applied to sequen- 
tial inference problems, improving the discriminative capability of HMMs without losing the probabilistic 
interpretation of such models. We believe this method is particularly effective when no parametric class- 
conditional distribution of observations is known a priori, which is usually the case in real-world problems. 
As density ratio estimation is a growing field, having already been successfully applied to various problems 
in statistical inference such as covariate shift and outlier detection, the ideas in this paper make further ad- 
vances in the field directly applicable to sequence modeling. It would be possible to apply this same principle 
to several other related sequential probabilistic models, another significant direction for future work. 

References 

[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector machines. In Proceedings 
of the 20th International Conference on Machine Learning, 2003. 

[2] S. Bickel, M. Briieckner, and T. Scheffer. Discriminative learning for differing training and test distribu- 
tions. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007), 
pages 81-88, 2007. 

[3] A. Gretton, A. Smola, J. Huang, M. SchmittfuU, K. Borgwardt, and B. Scholkopf. Govariate shift by 
kernel mean matching. In J. Quifionero-Gandela, M. Sugiyama, A. Schwaighofer, and N. Lawrence, 
editors, Dataset Shift in Machine Learning, pages 131-160, Gambridge, MA, USA, 2009. MIT Press. 

[4] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. 
Journal of Machine Learning Research, 10:1391-1445, Jul. 2009. 

[5] T Kanamori, T Suzuki, and M Sugiyama. Statistical Analysis of Least-Squares Density Ratio Estima- 
tion. Machine Learning, 86(3):335-367, 2012. 

[6] A. McGallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction 
and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning, 
volume 951, pages 591-598, 2000. 

[7] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood 
ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847-5861, 2010. 

[8] M. Piccardi and O. Perez. Hidden Markov models with kernel density estimation of emission probabilities 
and their use in activity recognition. In Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, 2007. 



11 



[9] V Punyakanok and D Roth. The Use of Classifiers in Sequential Inference. In Advances in Neural 

Inform,ation Processing Systems, vohimc 14, 2001. 

[10] J. Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 
85(3):619-630, 1998. 

[11] J. A. Quinn, C.K.I. Williams, and N. Mcintosh. Factorial switching linear dynamical systems applied to 
physiological condition monitoring. IEEE Transactions on Pattern Analysis and Machine Intelligence, 
31:1537-1551, 2009. 

[12] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. 

Proceedings of the IEEE, 77(2) :257- 286, 1989. 

[13] V. Roth. Kernel fisher discriminants for outlier detection. Neural computation, 18(4):942-960, 2006. 

[14] B. Scholkopf, R. Williamson, A. Smola, and J. Shawe-Taylor. SV estimation of a distribution's support. 

In Advances in Neural Information Processing systems, volume 12, 1999. 

[15] M. Sugiyama, H. Hachiya, M. Yamada, J. Simm, and H. Nam. Least-squares probabilistic classifier: A 
computationally efficient alternative to kernel logistic regression. In Proc. International Workshop on 
Statistical Machine Learning for Speech Processing (IWSML2012), pages 1-10, 2012. 

[16] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge 
University Press, Cambridge, UK, 2012. 

[17] M. Sugiyama, T. Suzuki, and T. Kanamori. Density ratio matching under the Bregman divergence: 
A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 
64(5): 1009-1044, 2012. 

[18] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Biinau, and M. Kawanabe. Direct importance 
estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699- 
746, 2008. 

[19] B Taskar, C Guestrin, and D Koller. Max-margin Markov networks. In Advances in Neural Information 
Processing Systems, volume 16, 2004. 

[20] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and 
interdependent output variables. Journal of Machine Learning Research, 6(2): 1453, 2006. 

[21] VN Vapnik. Statistical Learning Theory. Wiley, 1998. 

[22] P.C. Woodland and D. Povey. Large scale discriminative training of hidden Markov models for speech 
recognition. Computer Speech & Language, 16(l):25-47, 2002. 



12 



